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Abstract 

The multiplication and accumulation are the vital operations involved in almost all the Digital Signal 
Processing applications. Add-Multiply (AM) operator or Multiply -Accumulator (MAC) units are generally 
employed in all high performance digital signal processors (DSP) and controllers. The performance of AM 
operator mainly depends on the speed of multiplier. A lot of research has been contributed in this area and the 
conventional multipliers were modified to provide good speed performance but needs to be improved further 
along with area optimization. Urdhva-Tiryakbhyam Multiplier (UTM) architecture is adopted from ancient 
Indian mathematics "Vedas’ and can generate the partial products and sums in one step, which reduces the 
carry propagation from LSB to MSB. UTM can be used to implement high performance AM operators but 
results in larger silicon areas. This increased area can be minimized by using the modified compressor based 
design of UTM. In this work, the carry -look- ahead (CIA) adder is adopted instead of parallel adders for high 
speed of accumulation. So, the Compressor-Based-Urdhva-Tiryakbhyam (CB-UT) multiplier with CLA results in 
both area and performance optimization of Add-Multiply operator. The functionality of this architecture is 
evaluated by comparing with the Modified Booth (MB) multiplier based AM operator in terms of performance 
parameters like propagation delay, power consumption and silicon-area. The design is implemented and 
verified using Xilinx Spartan-3 E FPGA and ISE Simulator. . 

KEYWORDS .* AM operator, CB-UT multiplier, Urdhva-Tiryakbhyam multiplier 

I. Introduction 

Digital Signal Processing (DSP) systems enhance the performance of the modern consumer 
electronics by providing custom accelerators for domains of multimedia, communications etc. Typical 
DSP are employed in applications with a large number of arithmetic operations as their 
implementation is based on computationally intensive kernels, such as Fast Fourier Transform (FFT), 
Discrete Cosine Transform (DCT), Finite Impulse Response (FIR), convolution and filters. As 
expected, the performance of DSP systems is mostly affected by decisions on their design regarding 
the allocation and the architecture of arithmetic units such as Multiply-Accumulator (MAC) unit. 
Recent research activities in the field of arithmetic optimization of DSP systems [16], [11] have 
shown that the design of arithmetic units can lead to significant performance improvements by 
combining the operations which share data. Based on the observation of widely used DSP algorithms, 
the addition can often be subsequent to a multiplication. The Multiply-Accumulator (MAC) or 
Multiply-Add (MAD) or Add-Multiply (AM) units were introduced to address more efficient 
implementations of DSP systems compared to the conventional architectures, which use only 
primitive resources [9]. a lot of research have been contributed in this area and several architectures 
have been proposed to optimize the performance of the MAC units in terms of primary design 
constraints like area occupation, critical path delay or power consumption [10]— [12] . As noted in [13], 
MAC components can increase the flexibility of DSP systems’ data path synthesis as a large set of 
arithmetic operations can be efficiently mapped onto them. The MAC unit contains adders and 
multipliers and the speed of MAC unit greatly depends on its multiplier’s performance. This in turn 
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raises the demand for design of multipliers with high speed performance, at the same time 
maintaining low area and moderate power dissipation [2]. The straight forward design of the MAC 
unit, by first allocating an adder and then driving its output to the input of a multiplier, increases 
significantly both area and critical path delay of the circuit. 

Over the past few decades, several new novel architectures are developed for multipliers to improve 
its performance in terms of area, power and delay. Booth’s [4] and modified Booth’s algorithm based 
multipliers are highly used in modern Very Large Scale Integrated (VLSI) design but have their own 
set of disadvantages. In these multiplier algorithms, the multiplication process, involves several 
transitional operations before calculating the final result. The intermediate stages of these algorithms 
include several additions, subtractions and comparisons to compute the product, which reduce the 
speed of operation exponentially with the total number of bits present in the multiplier and the 
multiplicand [5]. Since the speed is major design constraint, multipliers with such type of 
architectures are not good approach to design MAC units, since it involves several time consuming 
intermediate operations. In order to address the disadvantages of the above mentioned existing 
methods [6] in terms of speed of operation, explored a new approach for multiplier architecture based 
on ancient Vedic mathematics. Vedic mathematics is an Indian ancient and eminent approach which 
provides a simple foundation to solve many mathematical challenges faced in the present scenario [2]. 
Vedic mathematics was existed in ancient India and re-discovered by a popular indian mathematician, 
Sri Bharati Krishna Tirthaji [7]. Tirthaji divided the Vedic mathematics into 16 simple sutras 
(formulae). These Sutras deals with many basic calculations such as Arithmetic, Analytical, Algebra, 
Geometry, Trigonometry, etc. The simplicity in the Vedic mathematics sutras shows a way for its 
application in several prominent domains of engineering and technology. These sutras are widely 
useful in Signal Processing, Control Engineering, communication systems and VLSI [14]. 

One of the highlights in Vedic mathematics approach is that the calculation of all partial products 
required for multiplication, are obtained well in development with fixed hardware architecture, before 
that the actual operations of multiplication begin. The generated intermediate partial products are 
added based on the Vedic mathematics algorithm to obtain the final product. This results in a very 
high speed approach to achieve multiplication operation [15]. 

In this work, a novel method has been proposed to further enhance in the speed of a AM operation by 
designing it using Urdhva -Tiryakbhyam (UT) Vedic Multiplier. The Vedic multiplier has been 
optimized by replacing the existing full adders and half adders of the multiplier with compressors 
Based (CB) adders. Compressors are logic circuits which are skilled of adding more than 3 bits at a 
time as divergent to a full adder and also can be designed with a lesser gate count and higher speed, in 
comparison with an equivalent full adder circuit and these are existed in several variants [15]. The 
Carry Look Ahead (CLA) adder has been adopted instead of parallel adders for high speed of 
accumulation during MAC operation. So, the Compressor Based Urdhva -Tiryakbhyam (CB-UT) 
multiplier with CLA provides both area and performance optimization of AM architecture. 

Here, the performance of the proposed CB-UT multiplier based AM operator by comparing it with the 
existing Modified Booth multiplier based AM operator in terms of Area, power consumption and 
speed performance. 

The rest of the paper is organized as follows: In section II, the motivation and technical background 
issues of present existing Modified Booth multiplier based AM operator are discussed. The proposed 
CB-UT multiplier based AM operator is presented in section III along with its design methodology. 
Section IV shows the performance evaluation of prosed method by the comparison with existing AM 
operator and section V concludes the work. 

II. Motivation and AM implementation 

2.1. Motivation 

In this paper, the AM unit is implemented by focusing on its speed performance as its main design 
constraint. The conventional architecture of the AM operator requires that its input are first driven to 
an adder and then the input and the sum are driven to a multiplier in order to get the final result. An 
optimized design of the AM operator is based on the performance of the multiply operational speed. 
The multiplier design can be optimized for speed or area using existing architectures but not both at 
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once. However, the optimization of speed results in increased area and vice-versa, in many existing 
multiplier architectures. To address this problem, the multiplier can be designed using UTM 
architecture and then the area can be minimized by design the internal adder blocks using the 
compressor based design. Another drawback of existing AM operator design is, due to the use of a 
conventional adder is that it inserts a significant delay in the critical path of the AM operator. As there 
are carry signals to be propagated inside the n-bit adder, the critical path depends on the bit-width of 
the inputs. In order to reduce this delay, a Carry-Look- Ahead (CLA) adder has been used in this work, 
however, it results in increased area occupation and power dissipation. This adder architecture can 
also be optimized for area and power by implementing it using compressor based design. As a result, 
significant area savings are observed along with a very high speed performance. In this work, a new 
technique has been proposed combining the advantages of Vedic UTM, CLA adder and compressor 
based architecture for optimized AM operator. 

2.2. Review of the Modified Booth Form 


Modified Booth (MB) is a prevalent form used in multiplication operations in many existing 
algorithms [1]. Modified Booth form provides faster results than conventional booth form by halving 
the total number of partial products comparing to any other radix-2 representation. MB form is a 
redundant signed-digit radix-4 encoding technique. Let us consider the multiplication of two binary 
numbers in 2’s complement form such as X and Y with each number consisting of n=2k bits. The 
multiplicand can be represented in MB form as: , 

2k— 2 ' 


Y = { y n -iy n - 2 -yi y 0 ) Ts = -y*- i- 2 2t_1 + X= 0 " >v 2 ' 
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Digits correspond to the three consecutive bits , , and with one bit overlapped and initially 
considering that .Table 1 shows how they are formed using the MB encoding technique. Each digit is 
represented by three types of signal bits such as one -bit, two-bit and sign-bit. The sign bit shows 

whether the digit is negative (5 ; - = 0 ) or positive ( S ■ =0 ). Signal one-bit shows if the absolute 


value of a digit is equal to 1 ( one = 1 ) or not ( one = 0 ). Signal two-bit shows if the absolute value 
of a digit is equal to 2 (two =1) or not (two = 0 ). These three bits are used to calculate the MB 
digits by the following relation: 


yT = ( - i)^ \° ne j + 2 .two j ] (4) 


TABLE 1. Modified Booth Algorithm encoding 
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The encoded bit equations can be generated from the values in above truth table using K-maps. The 
one-bit can be obtained from xor operation and this is used to generate Two-bit with one extra and 
operation. The sign bit can directly obtained from 


v i. (5) 


323 


Vol. 9, Issue 3, pp. 321-330 






International Journal of Advances in Engineering & Technology, June, 2016. 

©IJAET ISSN: 22311963 


one j = y v - 1 0 ^2 j 

twoj = (y 2j -\ © y 2j )- one j 
s j = 1 

2.3. Urdhva Tiryakbhyam Sutra 

Vedic Mathematics is the ancient methodology of Indian mathematics which has a unique technique 
of calculations). Basically the ancient Indian Vedic mathematics is divided into 16 different sutras to 
perform many mathematical calculations. Among these 16 sutras Urdhva-Tiryakbhyam is the most 
preferable and efficient algorithm to perform multiplication of integers as well as binary numbers. The 
term "Urdhva Tiryakbhyam” originated from 2 Sanskrit words Urdhva and Tiryakbhyam. “Urdhva” 
means “vertically” and “Tiryakbhyam” means “crosswise” respectively. Let us consider the two 8 bit 
numbers X (X7-X0) and Y (Y7-Y0), where index 0 to 7 represent bits from the Least Significant Bit 
(LSB) to the Most Significant Bit (MSB). Number P (P15-P0) represent each bit of the multiplied 
result [1]. The terms P0 to PI 5 are calculated by adding partial products, which are calculated by 
using the logical AND operation between X and Y bits. The bits obtained from equations (1) to (15) 
are concatenated to single 16-bit binary number produce the final product of multiplication. The carry 
bits generated at MSB bit are ignored since they are unnecessary [12]. 

2.4. Compressor based adder 

A compressor based adder is a combinational circuit which is used to add more than three binary 
numbers at a time. Here, uses 4:2 compressors and 7:2 compressor as novel building blocks for 
designing high level compressors. Compressor based architectures can easily swap the combination of 
some full adders and half adders; thereby, design of high performance processors is possible. The 
corresponding circuits of compressors are deliberated below. 

COMPRESSOR3 2:1 

__ — 

LPM XOR2 1 LPM XOR2 1 


Mxor x j Mxor Sum 


u3 

W m 

COMPRESSOR3 2 

Fig.l. 3 :2 compressor based adder 

2.5. Compressor based Urdhva Tiryakbhyam Multiplier 

The Urdhva Tiryakbhyam method based multiplication requires several full adders and half adders to 
add partial products which lead to a large propagation delay. So, to address this problem, a novel 
structure has been proposed, by combining the compressor architecture and utilized the same in the 
existing Urdhva-Tiryakbhyam based architecture in order to reduce the unnecessary partial products 
addition. The basic architecture for this design was shown in figure 2. From the figure 1, the 
compressor based UTM requires only 12 parallel stages whereas the conventional Urdhva- 
tiryakbhyam multiplier uses 15 parallel stages. This design results in the great improvement with 
respect to high speed and low area multiplier design and also many of the stages have been reduced. 


( 6 ) 

(7) 
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Fig.2 .Basic Vedic multiplication process 

III. Design Methodology 

In this work, the top level AM operator module is designed by following top-down design 
methodology. In second level, the AM operator is divided into two sub-modules such as multiplier 
and accumulator. Then these modules are again divided into lower level sub -modules until reach the 
gate level modules. The Vedic multiplier is implemented by considering a 2x2 multiplier as its basic 
multiplication unit which consists of 3 AND gates and 2 half adder modules. This 2x2 multiplier 
module has been designed using compressor based adders to reduce the area is show below figure 3. 
This 2x2 CBUT multiplier is used for higher level 4x4 CBUT multiplier design. This bottom-top 
implementation process is repeated up to designing a 16x16 CBUT multiplier using 8x8 multiplier 
modules. 


vedic 2x2: 



v ed ic 2_x _2 


Fig.3. RTL schematic diagram of CB-UT multiplier based AM operator 


Consider, "a" and "b" are two numbers to be multiplied and "p" is the product. Figure 3 illustrates the 
steps to multiply two 2-bit binary numbers. Converting the multiplication process to a hardware 
equivalent, the architecture have 3 AND gates which will act as 2-bit multipliers and two half adders 
to add the products to get the final product. 4x4 bit multipliers are designed using 4 such 2x2 
multipliers and 3 adders as shown in the design with proper instantiating of the 2x2 multipliers and 
adders. Here, the adder modules are of compressor type to improve the performance. This bottom-top 
implementation is carried out until integrate 16x16 multiplier using four 8x8 multipliers and 3 
compressor based adder modules as shown in below figure 4[3]. 
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Fig.4. RTL schematic diagram of 16x16 CB-UT multiplier 
This 16x16 optimized CB-UT Vedic multiplier is used for design of high performance AM Operator 
by connecting it with a accumulator module as shown in below figure 5. The AM operator consists of 
a CB-UT multiplier and an accumulator. The accumulator composed of an adder and register to store 
the result for one clock period. An extra flip-flop is used for storing the carry of adder for further 
addition. This extra flip-flop is used to store MSB in order to avoid overflow effect. 



Fig.5. RTL schematic diagram of CB-UT multiplier based AM operator 


IV. Results 

To evaluate the performance of the proposed CB-UT based AM Operator a comparison has been 
carried out in between two architectures of AM Operator such as AM Operator using compressor 
based Urdhva Tiryakbhyam multiplier and AM Operator using MB Multiplier. These two 
architectures were designed and implemented on a XILINX Spartan-3E-XC3S500E FPGA by using 
Verilog HDL as the RTL programming language by using Xilinx Project Navigator ISE 13.2. The 
Spartan-3E used for the experiments has a speed grade of 5 and package FG320. The simulated and 
synthesized results have been tabulated in Table 1. 
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The simulation results are used to check the correctness of functionality of circuit. 6 a’ and ‘b’ are the 
inputs ‘c’ is the output. Two global clocks ‘elk’ and ‘clk2’ are used for synchronization of sub- 
modules of the AM operator, ‘en’ is the enable signal to activate the circuit and ‘clrb’ is the clear 
signal in negative logic. As shown in figure 6 let consider, ‘a’ as 17(0000000000010001) , ‘b’ as 3 
(0000000000000011), ‘en’ as ‘1’ and assign ‘clrb’ with ‘0’ and again set to T’. ‘elk’ and ‘clk2’ are 
assigned with two clock signals. Then the AM operation can be observed with accumulated operation. 
The synthesis summery reports of AM operators using modified Booth multiplier and CB-UT 
multiplier were shown in figure and figure respectively, and their Technology map schematic 
diagrams were also shown in figure and figure respectively. 



Device Utilization Summary (estimated values) 

1 a 

Logic Utilization 

Used 

Available 

Utilization 

Number of Slices 

498 

4656 

10% 

Number of Slice Flip Flops 

278 

9312 

2% 

Number of 4 input LUTs 

941 

9312 

10% 

Number of bonded IOBs 

100 

232 

43% 

Number of GCLKs 

2 

24 

8% 


Fig.8. Synthesis summery reports of AM operators using CB-UT multiplier 
The area optimization of the CB-UT multiplier based AM operator can be evaluated from the 
comparison of above utilization summery reports. MB multiplier based AM operator requires 1526 
number of 4-input LUTs whereas the CB-UT 941 number of LUTs which reduce the area by 6% of 
total available LUTs. The number of slices also reduced from 825 to 498 which results in reduction by 
7% of total available slices. 



Fig.9. Technology schematic diagram of modified booth multiplier based AM operator 


The technology schematic diagrams represent the placement and routing of LUTs which are used for 
designed circuits. It also provides the information regarding the input-output buffers and global clocks 
along with input-output ports. 
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Fig.lO.Technology schematic diagram of CB-UT multiplier based AM operator 


The area reduction can directly observed from the above technology schematic maps of AM operators 
using MB multiplier and CB-UT multiplier. 



Device 


Part 

| XC3S500E 

- 

Package 

| FG320 

3 

Grade 

| Commercial ▼ 

Process 

| Typical 

- 


Thermal Information 


Ambient Temp (°C) 

25 0 

Airflow (LFM) 

250 

Oja (°C/W) 

20.4 

Custom 0JA 


Max Ambient (°C) 

76 4 

Junction Temp(°C) 

33.6 


XPower Estimator (XPE) -11.1 


f I xilinx 


Block Summary 


Block 

Power (W) 

CLOCK 

0.037 

LOGIC 

0.024 

IO 

0.278 

BRAM 

0.000 

DCM 

0.000 

MULT 

0.000 


Power Summary 


Optimization 

None 

Data 

Production 

Quiescent(W) 

0.081 

Dynamic (W) 

0.339 

Total (W) 

0.421 


Voltage Source Information 


Source 

Voltage 

Power (W) 

»cc(A) 

IcCQ (A) I 

VcCINT 

1.2 

0 109 

0.062 

0 028 

Vccaux 

2.5 

0.062 

0.007 

0.018 

Vcco 3.3 

3.3 

0.000 

0.000 

0.000 

Vcco 2.5 

2.5 

0.250 

0.099 

0.001 

Vcco 1-8 

1.8 

0.000 

0.000 

0.000 

Vcco 1-5 

1.5 

0 000 

0.000 

0.000 

Vcco 1 -2 

1.2 

0.000 

0.000 

0.000 


I* 


Fig.ll. power analysis using XPower Estimator. 


c import from ISE ... 


<? Reset to Defaults 

RJ' Import from XPE... 


Set T oggle Rate . . 


Options . . . Set Default Clock.. " | 


The static power can directly obtained from synthesis results but Xilinx software can’t provide 
dynamic power information. For this dynamic power analysis the Xilinx provides plugin support such 
as XPower Estimator (XPE). XPower Estimator- 11.1 is Microsoft Excel spread book, which provides 
detailed power analysis by using mapping report file generated during the synthesis using Xilinx 
synthesizer. The dynamic power is 339mW and static power is 81mW. The total power is 421mW 
whereas the total power consumption for MB multiplier based AM operator is 437 mW. 


Table 2. Performance evaluation table 


Parameter 

MB_AM 

CBUT-AM 

AREA(slices) 

825 

498 

AREA(LUTs) 

1526 

941 

Delay (ns) 

62.76 

16.71 

Power (mw) 

437 

421 

power * Delay 

2742.612 

703.491 
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Fig.12. Performance evaluation comparison of MB multiplier based AM operator and CB-UT based AM 

operator 


The above comparison table shows that the CB-UT multiplier based architecture provides 
optimization for AM operator design in terms of area, power and speed performance. The 
combinational delay can be efficiently reduced from 62.76 ns to 16.71 ns by replacing MB multiplier 
with CB-UT multiplier. The power delay product of proposed architecture also compared against that 
of MB multiplier based AM operator. The power delay product improved by 74.3%. 


V. Conclusion 

This work focuses on optimizing the design of Add Multiply operator. The proposed structured 
architecture for the CB-UT multiplier optimizes the performance of multiplier both in terms of 
speed and area. The CLA adder using compressor based adder provides the accumulation in one 
system clock. Hence, the compressor based UTM and CLA adder integrated to optimize the 
speed and area of the AM operator. The performance is evaluated by comparing with MB based 
AM operator using Xilinx ISE in terms of area, power and delay. This efficient architecture of 
AM operator reduces the combinational delay from 62.76 ns to 16.71 ns and exhibits better 
speed performance along with efficient reduction in area and power. 


VI. Future Scope 

The proposed AM operator architecture can be utilized in high performance DSP processors 
and Micro controllers. Even though the proposed architecture provides better speed, area and 
power performance, still it has to be optimized in terms of customized placement and routing 
for full-custom VLSI design. The proposed CB-UT multiplier architecture has to be 
synchronized with the system clock control by inserting delay control and synchronization 
mechanism between input and output. 
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